In the previous article,we introduced the producer-consumer model and briefly mentioned Apache Kafka as a representative implementation in the big data ecosystem.
In this article, we take a deeper look at Kafka itself.
Specifically, we explain what it is, how it works, and why it has become a foundational component in modern data architectures.
What Is Apache Kafka?
Apache Kafka is an open-source, distributed event streaming platform originally developed at LinkedIn and later donated to the Apache Software Foundation.
Today, it is widely used for real-time data pipelines, event-driven systems, and large-scale stream processing.
Unlike traditional message queues, Kafka is designed as a persistent distributed log.
As a result, messages are appended sequentially to disk and retained for a configurable period, rather than being deleted immediately after consumption.
For official background and documentation, see:
- Apache Kafka Official Site: https://kafka.apache.org
- Kafka Documentation: https://kafka.apache.org/documentation/
Core Design Goals
Kafka was created to solve large-scale data flow problems that traditional queues could not handle efficiently.
Therefore, its design focuses on the following goals:
- Constant-time message persistence Writes and reads remain efficient even with terabytes of data.
- High throughput with low latency On commodity hardware, Kafka can handle hundreds of thousands of messages per second.
- Parallel and distributed consumption Multiple consumers can read data concurrently while preserving order within partitions.
- Horizontal scalability Capacity increases simply by adding more brokers.
Key Capabilities Beyond Traditional Queues
In addition to its core goals, Kafka introduces several capabilities that distinguish it from classic messaging systems:
- Consumers can replay messages from any offset
- Message retention is time- or size-based, not consumption-based
- Replication ensures fault tolerance across broker failures
- Consumer groups enable scalable parallel processing
Because of these features, Kafka often acts as both a message system and a data backbone.
Architecture Overview
Kafka follows the producer-consumer model, but with a distributed architecture optimized for scale.
Core Components
- Producer – Publishes records to topics
- Broker – Stores data and serves client requests
- Consumer – Reads records from topics
Messages are organized into topics, which are further divided into partitions.
Each partition is an ordered, append-only log stored on disk.
To better understand how this compares with other messaging systems, you may also read:
- [RabbitMQ and the Producer-Consumer Model](https://xx/RabbitMQ and the Producer-Consumer Model)
- [Redis and the Producer-Consumer Model](https://xx/Redis and the Producer-Consumer Model)
Topics, Partitions, and Segments
Although a topic is a logical concept, its data is physically split into partitions.
Each partition is stored as multiple segment files, which improves disk management and read performance.
Because partitions are independent, Kafka can scale throughput linearly as partitions increase.
Meanwhile, ordering is guaranteed within each partition.
Distributed Deployment and Coordination
Kafka typically runs in a cluster.
Partitions are distributed across brokers, and replicas are placed on different nodes to avoid single points of failure.
Traditionally, ZooKeeper manages cluster metadata and leader election.
However, newer versions are gradually moving toward KRaft, Kafka’s built-in consensus mechanism.
Getting Started Quickly
Before producing or consuming data, create a topic:
kafka-topics.sh --create \
--topic test-topic \
--bootstrap-server localhost:9092 \
--partitions 3 \
--replication-factor 3
After that, you can produce and consume messages using either client libraries or CLI tools.
Common Use Cases
Kafka is now a core building block in many production systems.
Real-Time Data Processing
It serves as the ingestion layer for engines such as Flink and Spark Streaming, enabling real-time analytics.
Event-Driven Architecture
Services publish events to topics, while downstream systems react asynchronously.
Centralized Log Collection
Kafka reliably aggregates logs from distributed services for analysis and monitoring.
Conclusion
Kafka is not just another message queue.
Instead, it is a distributed event streaming platform designed for high throughput, durability, and scalability.
By combining persistent storage with parallel consumption, Kafka has become a cornerstone of modern data infrastructure.
In the next article,
The Design Philosophy of Kafka,
we will explore the engineering decisions that enable its performance and reliability at scale.